10 research outputs found

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Operationalizing Machine Learning: An Interview Study

    Full text link
    Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering -- how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders? We conducted semi-structured ethnographic interviews with 18 MLEs working across many applications, including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning. We summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, we discuss interviewees' pain points and anti-patterns, with implications for tool design.Comment: 20 pages, 4 figure

    Revisiting Prompt Engineering via Declarative Crowdsourcing

    Full text link
    Large language models (LLMs) are incredibly powerful at comprehending and generating data in the form of text, but are brittle and error-prone. There has been an advent of toolkits and recipes centered around so-called prompt engineering-the process of asking an LLM to do something via a series of prompts. However, for LLM-powered data processing workflows, in particular, optimizing for quality, while keeping cost bounded, is a tedious, manual process. We put forth a vision for declarative prompt engineering. We view LLMs like crowd workers and leverage ideas from the declarative crowdsourcing literature-including leveraging multiple prompting strategies, ensuring internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make prompt engineering a more principled process. Preliminary case studies on sorting, entity resolution, and imputation demonstrate the promise of our approac

    Waltzing binaries: Probing line-of-sight acceleration of merging compact objects with gravitational waves

    Full text link
    Line-of-sight acceleration of a compact binary coalescence (CBC) event would modulate the shape of the gravitational waves (GWs) it produces with respect to the corresponding non-accelerated CBC. Such modulations could be indicative of its astrophysical environment. We investigate the prospects of detecting this acceleration in future observing runs of the LIGO-Virgo-KAGRA network, as well as in next-generation (XG) detectors and the proposed DECIGO. We place the first observational constraints on this acceleration, for putative binary neutron star mergers GW170817 and GW190425. We find no evidence of line-of-sight acceleration in these events at 90%90\% confidence. Prospective constraints for the fifth observing run of the LIGO at A+ sensitivity suggest that accelerations for typical BNSs could be constrained with a precision of a/c107 [s1]a/c \sim 10^{-7}~[\mathrm{s}^{-1}], assuming a signal-to-noise ratio of 1010. These improve to a/c109 [s1]a/c \sim 10^{-9}~[\mathrm{s}^{-1}] in XG detectors, and a/c1016 [s1]a/c \sim 10^{-16}~[\mathrm{s}^{-1}] in DECIGO. We also interpret these constraints in the context of mergers around supermassive black holes.Comment: Accepted to Ap

    Decibel: the relational dataset branching system

    Get PDF
    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat

    Influence of plasma modification on mechanical and thermal properties of Polypropylene/ Nano-Calcium Silicate Composites

    No full text
    The aim of the research is to study the influence of plasma modification on nano calcium silicate/polypropylene composites. Polypropylene (PP) is considered for this study as it possesses high impact strength, toughness and availability. Calcium silicate is considered as reinforcement because of its high temperature resistance, high flexural strength and high strength to mass ratio. Fourier transform infrared spectroscopy (FTIR) results show that there is a change in the functional group on the surface of calcium silicate after modification. Thermo-Gravimetric Analysis (TGA), Differential Scanning Calorimetry (DSC) results show that the decomposition temperature increased with increasing amount of filler particles. It is also observed that the modification has produced a marginal increase in the decomposition and glass transition temperature. Tensile test results also show a gradual increase in the tensile properties of composites when high ratio is reinforcement. Tensile test results also show that there is a marginal increase in the tensile strength when reinforced with modified calcium silicate when compared to non-modified. Scanning Electron Microscopy (SEM) reveals that there is a enhanced dispersion of nano particles on modification. Based on the findings it can be concluded that plasma modification enhances the thermal and mechanical property marginally

    Influence of plasma modification on mechanical and thermal properties of Polypropylene/ Nano-Calcium Silicate Composites

    No full text
    The aim of the research is to study the influence of plasma modification on nano calcium silicate/polypropylene composites. Polypropylene (PP) is considered for this study as it possesses high impact strength, toughness and availability. Calcium silicate is considered as reinforcement because of its high temperature resistance, high flexural strength and high strength to mass ratio. Fourier transform infrared spectroscopy (FTIR) results show that there is a change in the functional group on the surface of calcium silicate after modification. Thermo-Gravimetric Analysis (TGA), Differential Scanning Calorimetry (DSC) results show that the decomposition temperature increased with increasing amount of filler particles. It is also observed that the modification has produced a marginal increase in the decomposition and glass transition temperature. Tensile test results also show a gradual increase in the tensile properties of composites when high ratio is reinforcement. Tensile test results also show that there is a marginal increase in the tensile strength when reinforced with modified calcium silicate when compared to non-modified. Scanning Electron Microscopy (SEM) reveals that there is a enhanced dispersion of nano particles on modification. Based on the findings it can be concluded that plasma modification enhances the thermal and mechanical property marginally

    Collaborative data analytics with DataHub

    No full text
    While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data-processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook---an IPython-based notebook for analyzing data and storing the results of data analysis

    Hetero-bimetallic cooperative catalysis for the synthesis of heteroarenes

    No full text
    corecore